Support Linear State in SDPA Pipeline by apaniukov · Pull Request #3359 · openvinotoolkit/openvino.genai

apaniukov · 2026-02-19T12:36:37Z

Description

Support fixed-size cache state for linear/hybrid attention models.

Core abstraction — CacheTypes and CacheState (utils.hpp, utils.cpp)

New CacheTypes class with a bitmask tracking has_kvcache() and has_linear().
New CacheState class replacing the former KVCacheState, carrying cache type information, trim/reset flags, and the token-history mirror.
get_cache_types(const ov::Model&) detects cache kinds from ReadValue node shapes: 4D dynamic = KV-cache, 3D dynamic = linear (SSM) state.
trim_kv_cache() updated to handle hybrid models: resets linear caches on trim (full reset), trims only KV-cache tensors for attention.

Stateful LLM pipeline (pipeline_stateful.cpp, lm_encoding.cpp)

m_cache_state (formerly KV-only) replaced with CacheState constructed from the model on init.
align_kv_cache_and_history() sets reset_mem_state = true explicitly when state is empty (preserving first-call reset behavior, now decoupled from needs_reset()).

VLM pipeline (pipeline.cpp, inputs_embedder.cpp)

CacheState propagated to VLM's language-model path; cache types set from the language model on construction.

Speculative decoding (fast_draft_strategy.cpp, eagle3_strategy.cpp)

Both LLMInferWrapper and Eagle3InferWrapperBase now detect cache types from the model and store m_cache_types.

Tests

New test_cache_types.cpp: CSV-driven parameterized GTests for get_cache_types() against real converted OV models.
data/cache_types_models.csv: 3 model entries (Phi3=kvcache-only, LFM2=hybrid, Mamba=linear-only).
run_cache_types_tests.sh: local helper script to convert models and run tests.
CMakeLists.txt: installs data/ alongside the test binary.

CI

New "Convert models for cache types gtests" step converts models from HuggingFace (with caching).
TEST_MODELS_BASE_DIR and CACHE_TYPES_CSV env vars passed to the existing gtests step.

CVS-181414

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code.
This PR fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR generalizes the “KV cache state” tracking to support fixed-size linear (and hybrid) cache state in stateful/SDPA-based pipelines by introducing a unified cache state type and propagating it through LLM/VLM/speculative decoding codepaths.

Changes:

Replaced KVCacheState with CacheState across pipelines and embedders.
Added cache kind detection (CacheTypes / get_cache_types) and updated cache-trimming behavior to reset for linear caches.
Wired cache-kind awareness into speculative decoding wrappers and stateful LLM pipeline initialization.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/cpp/src/visual_language/vision_token_pruning_processor.hpp	Updates pruning processor API to use `CacheState`.
src/cpp/src/visual_language/vision_token_pruning_processor.cpp	Updates pruning processor implementation signature to `CacheState`.
src/cpp/src/visual_language/pipeline.cpp	VLM pipeline now uses `CacheState` when managing chat history/cache trimming.
src/cpp/src/visual_language/phi4mm/classes.cpp	Switches embedder history/cache bookkeeping to `m_cache_state`.
src/cpp/src/visual_language/phi3_vision/classes.cpp	Switches embedder history/cache bookkeeping to `m_cache_state`.
src/cpp/src/visual_language/inputs_embedder.hpp	Replaces stored state from `KVCacheState` to `CacheState`.
src/cpp/src/visual_language/inputs_embedder.cpp	Updates chat/history alignment and rollback bookkeeping to `CacheState`.
src/cpp/src/utils.hpp	Introduces `CacheTypes`, `CacheState`, and `get_cache_types()` API.
src/cpp/src/utils.cpp	Implements cache kind detection and updates `trim_kv_cache()` behavior for linear caches.
src/cpp/src/speculative_decoding/stateful/fast_draft_strategy.hpp	Adds `CacheTypes` member to infer wrapper.
src/cpp/src/speculative_decoding/stateful/fast_draft_strategy.cpp	Initializes `CacheTypes` and uses it to build `CacheState` for trimming.
src/cpp/src/speculative_decoding/stateful/eagle3_strategy.hpp	Adds `CacheTypes` member to eagle3 infer wrapper base.
src/cpp/src/speculative_decoding/stateful/eagle3_strategy.cpp	Initializes `CacheTypes` and uses it to build `CacheState` for trimming.
src/cpp/src/lm_encoding.hpp	Updates encoding helpers to accept `CacheState`.
src/cpp/src/lm_encoding.cpp	Updates chat-history alignment logic and cache-state updates for `CacheState`.
src/cpp/src/llm/pipeline_stateful.hpp	Renames stored cache reflection to `m_cache_state` and renames reset helper.
src/cpp/src/llm/pipeline_stateful.cpp	Initializes `CacheState` from model and propagates it through chat/trim logic.

Comments suppressed due to low confidence (1)

src/cpp/src/utils.cpp:525

trim_kv_cache() resets the InferRequest when reset_mem_state is set (or when linear cache needs reset), but it returns without clearing cache_state.reset_mem_state / num_tokens_to_trim or updating the token reflection state. This can leave CacheState inconsistent (stale tokens / repeated resets) for subsequent steps. Consider resetting the CacheState fields when a reset happens (and clearing the token reflection if the underlying model state is cleared).

void trim_kv_cache(ov::InferRequest request, CacheState& cache_state, std::optional<AdapterController> adapter_controller) {
    if (
        cache_state.reset_mem_state
        // linear cache stores only the last state, trimming is not possible, so we reset the whole cache in this case
        || (cache_state.num_tokens_to_trim > 0 && cache_state.has_linear())
    ) {
        if (adapter_controller) {
            for(auto& state: request.query_state()) {
                if(!adapter_controller->has_state_name(state.get_name())) {
                    state.reset();
                }
            }
        } else {
            request.reset_state();
        }

        return;

src/cpp/src/utils.cpp

src/cpp/src/lm_encoding.cpp

src/cpp/src/visual_language/inputs_embedder.cpp

src/cpp/src/visual_language/inputs_embedder.hpp

src/cpp/src/visual_language/pipeline.cpp

src/cpp/src/utils.hpp

src/cpp/src/utils.cpp

as-suvorov · 2026-02-19T12:54:05Z

I converted PR to draft as it labeled as WIP

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

src/cpp/src/utils.cpp

src/cpp/src/lm_encoding.cpp

src/cpp/src/utils.hpp

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Copilot · 2026-02-26T10:41:52Z

src/cpp/src/utils.cpp

+CacheTypes get_cache_types(std::shared_ptr<const ov::Model> model) {
+    // "ReadValue" node is cache representation in stateful model
+    const std::string state_node_type_name = std::string(ov::op::v6::ReadValue::get_type_info_static().name);
+    CacheTypes cache_types;
+
+    for (const auto op : model->get_ops()) {
+        // check input size, as in LoRA adapters case it could be 0
+        if (op->get_type_name() != state_node_type_name || op->get_input_size() < 1) {
+            continue;
+        }
+
+        // Shape example: [-1,4,0,64]
+        auto shape = op->get_input_partial_shape(0);
+        const auto rank = shape.rank().get_length();
+        size_t dynamic_axis_count = 0, zero_axis_count = 0;
+        for (size_t i = 0; i < rank; i++) {
+            if (shape[i].is_dynamic()) {


get_cache_types() calls shape.rank().get_length() unconditionally. If a ReadValue input has dynamic rank, get_length() can throw; this would make cache-type detection fail at runtime for some models. Guard with shape.rank().is_dynamic() (skip/continue or handle) before calling get_length(), and similarly avoid iterating dimensions when rank is dynamic.

src/cpp/src/utils.cpp

Copilot · 2026-02-26T10:41:52Z

src/cpp/src/utils.cpp

+CacheTypes get_cache_types(std::shared_ptr<const ov::Model> model) {
+    // "ReadValue" node is cache representation in stateful model
+    const std::string state_node_type_name = std::string(ov::op::v6::ReadValue::get_type_info_static().name);
+    CacheTypes cache_types;
+
+    for (const auto op : model->get_ops()) {


New cache-type detection (get_cache_types) and the linear-cache reset path in trim_kv_cache() introduce non-trivial behavior that can regress chat/history correctness. There are existing gtests for utils (e.g., tests/cpp/utils.cpp), but no coverage for these new paths; please add unit tests covering KV-only, linear-only, and hybrid detection and verifying reset/trim bookkeeping.

src/cpp/src/utils.hpp

src/cpp/src/visual_language/pipeline.cpp

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

src/cpp/src/llm/pipeline.cpp

Copilot · 2026-03-10T09:33:23Z

src/cpp/src/visual_language/pipeline.cpp

+        // PA backend does not support linear attention states (conv/SSM caches).
+        if (attention_backend == PA_BACKEND
+            && utils::has_linear_attention_states(models_dir, properties)) {
+            if (utils::explicitly_requires_paged_attention(user_properties)
+                || user_properties.find("ATTENTION_BACKEND") != user_properties.end()) {
+                GENAI_WARN("PA backend does not support models with linear attention states. The model may work incorrectly.");
+            } else {
+                attention_backend = SDPA_BACKEND;
+            }
+        }


has_linear_attention_states(models_dir, properties) loads the language model to inspect its states, but VLMPipelineImpl(models_dir, ...) will also read/compile the same language model. Consider reusing the already-loaded language_model from VLMPipelineImpl (or reading it once in this constructor and passing it down) to avoid duplicated model reads/parsing at initialization.

Copilot · 2026-03-10T09:33:24Z

src/cpp/src/utils.cpp

+    for (const auto op : model.get_ops()) {
+        // check input size, as in LoRA adapters case it could be 0
+        if (op->get_type_name() != state_node_type_name || op->get_input_size() < 1) {
+            continue;


In get_cache_types(), the loop uses for (const auto op : model.get_ops()), which copies each shared_ptr and bumps the atomic ref-count for every op. Using const auto& op avoids that overhead (and is more consistent with performance-sensitive model graph walks).

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated no new comments.

Co-authored-by: Vladimir Zlobin <vladimir.zlobin@intel.com>

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T12:10:45Z

src/cpp/src/utils.cpp

+        // Shape example: [-1,4,0,64]
+        auto shape = op->get_input_partial_shape(0);
+        const auto rank = shape.rank().get_length();
+        size_t dynamic_axis_count = 0, zero_axis_count = 0;
+        for (size_t i = 0; i < rank; i++) {
+            if (shape[i].is_dynamic()) {


get_cache_types() calls shape.rank().get_length() without checking shape.rank().is_static(). If a model contains a ReadValue with dynamic rank, this will throw/assert inside OpenVINO. Add a guard (e.g., if (!shape.rank().is_static()) continue;) before using get_length() and iterating the rank.

Copilot · 2026-03-10T12:10:46Z

src/cpp/src/utils.cpp

        // Shape example: [-1,4,0,64]
        auto shape = op->get_input_partial_shape(0);
+        if (shape.rank().get_length() != 4) {
+            // kv cache should have 4 dimensions
+            continue;


get_kv_axes_pos() uses shape.rank().get_length() in the != 4 check without verifying that the rank is static. If rank is dynamic, get_length() can throw/assert. Consider checking shape.rank().is_static() first and skipping ReadValue nodes with dynamic rank.

Copilot · 2026-03-10T12:10:46Z

src/cpp/src/visual_language/inputs_embedder.hpp

    // get reflection of tokens contained in the kv cache
-    utils::KVCacheState& get_kv_cache_state();
+    utils::CacheState& get_kv_cache_state();


The comment says this returns a reflection of tokens contained in the KV cache, but the type was changed to utils::CacheState and now covers non-KV cache kinds (e.g., linear/SSM state) as well. Consider updating the comment (and possibly the accessor name) so it matches the new semantics.

Wovchena · 2026-03-10T12:11:49Z

src/cpp/src/utils.cpp


+
+bool has_linear_attention_states(const std::filesystem::path& models_path, const ov::AnyMap& properties) {
+    return get_cache_types(*read_model(models_path, properties)).has_linear();


Not used anymore

Used in VLM constructor.

Model based constructors are needed in this case for VLMPipelineImpl and VLMContinuousBatchingAdapter. Model reading is heavy

src/cpp/src/lm_encoding.cpp

src/cpp/src/llm/pipeline.cpp

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.

Copilot · 2026-03-10T18:20:05Z

tests/python_tests/utils/hugging_face.py

 from dataclasses import dataclass
 from pathlib import Path
 from typing import Type
+import subprocess


import subprocess will be flagged by Bandit (B404) in this repo (Bandit runs recursively without excluding tests/). Other test files suppress this with # nosec B404 on the import. Consider adding the same suppression here (or refactoring to reuse the existing helper that already carries the suppression) to avoid CI failures.

Suggested change

import subprocess

import subprocess # nosec B404

Rename KVCache to Cache

d0f0fe0

Copilot AI review requested due to automatic review settings February 19, 2026 12:36

apaniukov requested review from pavel-esir, sbalandi and yatarkan as code owners February 19, 2026 12:36

apaniukov added the do_not_merge label Feb 19, 2026

apaniukov requested review from Wovchena and as-suvorov as code owners February 19, 2026 12:36

apaniukov added the do_not_review label Feb 19, 2026

apaniukov changed the title ~~Support Linear State in SDPA Pipeline~~ [WiP] Support Linear State in SDPA Pipeline Feb 19, 2026

github-actions bot added category: visual language Visual language pipeline category: LLM LLM pipeline (stateful, static) category: speculative decoding Speculative decoding no-match-files labels Feb 19, 2026

Copilot started reviewing on behalf of apaniukov February 19, 2026 12:37 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

as-suvorov marked this pull request as draft February 19, 2026 12:54

apaniukov added 2 commits February 19, 2026 14:50

Update CacheTypes

033016d

Merge branch 'master' into lfm2-stateful-model

11e6915

Copilot AI review requested due to automatic review settings February 20, 2026 13:31

Copilot started reviewing on behalf of apaniukov February 20, 2026 13:32 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

src/cpp/src/lm_encoding.cpp Outdated Show resolved Hide resolved

src/cpp/src/utils.hpp Show resolved Hide resolved

apaniukov added 2 commits February 20, 2026 15:49

WiP

3fd3196

Update cache reset

ee016d9

Copilot AI review requested due to automatic review settings February 26, 2026 10:35

Copilot started reviewing on behalf of apaniukov February 26, 2026 10:35 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

apaniukov added 2 commits February 26, 2026 12:52

Fix KVCache detection

ccf1a81

Merge branch 'master' into lfm2-stateful-model

63dcbdc

apaniukov added 4 commits March 9, 2026 18:24

Renave var

5fbad0e

Extend Tests

051c3d3

Fix VLM attention mask size

8ce6b85

Ruff format LLM Tests

da0fb3b

Copilot AI review requested due to automatic review settings March 10, 2026 09:27

github-actions bot added category: samples dependencies category: tests dependencies labels Mar 10, 2026

Copilot started reviewing on behalf of apaniukov March 10, 2026 09:27 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 10, 2026 10:28

Copilot started reviewing on behalf of apaniukov March 10, 2026 10:28 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

apaniukov and others added 4 commits March 10, 2026 11:08

Update model list

459ee97

Update requirements

8ad6d27

Apply suggestions from code review

66a50ec

Co-authored-by: Vladimir Zlobin <vladimir.zlobin@intel.com>

Merge branch 'master' into lfm2-stateful-model

e62f1a4

Wovchena requested a review from Copilot March 10, 2026 12:03

Copilot started reviewing on behalf of Wovchena March 10, 2026 12:03 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Wovchena requested changes Mar 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 10, 2026 12:44

Copilot started reviewing on behalf of apaniukov March 10, 2026 12:44 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

apaniukov added 3 commits March 10, 2026 14:35

Address review

1e9147d

Address Review

6a5a4d6

Update Tests Requirements

2924ff5

Copilot AI review requested due to automatic review settings March 10, 2026 18:11

Copilot started reviewing on behalf of apaniukov March 10, 2026 18:12 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Fix Tests

6199375



		bool has_linear_attention_states(const std::filesystem::path& models_path, const ov::AnyMap& properties) {
		return get_cache_types(*read_model(models_path, properties)).has_linear();

Conversation

apaniukov commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Core abstraction — CacheTypes and CacheState (utils.hpp, utils.cpp)

Stateful LLM pipeline (pipeline_stateful.cpp, lm_encoding.cpp)

VLM pipeline (pipeline.cpp, inputs_embedder.cpp)

Speculative decoding (fast_draft_strategy.cpp, eagle3_strategy.cpp)

Tests

CI

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

as-suvorov commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Wovchena Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

apaniukov Mar 10, 2026

Choose a reason for hiding this comment

apaniukov commented Feb 19, 2026 •

edited

Loading